delay health checks until transport socket secrets are ready. #13516

mpuncel · 2020-10-12T18:27:47Z

This fixes a sequencing issue that could result in undesirable Envoy
bootup times. When active health checks are configured on a cluster
using a transport socket that uses SDS to fetch secrets, it was possible
for the initial health check to run before secrets are loaded, which
results in a failure. The health check would then wait for the
no_traffic_interval (default of 60s) before trying again.

This commit makes health checks wait until secrets are loaded before
starting, implemented by adding support for registering a callback on
TransportSocketFactory. SslSocketFactory supports invoking the callback
when secrets are loaded, and all other socket types currently invoke the
callback immediately, preserving existing behavior.

Signed-off-by: Michael Puncel mpuncel@squareup.com

Commit Message:
delay health checks until transport socket secrets are ready.

This fixes a sequencing issue that could result in undesirable Envoy
bootup times. When active health checks are configured on a cluster
using a transport socket that uses SDS to fetch secrets, it was possible
for the initial health check to run before secrets are loaded, which
results in a failure. The health check would then wait for the
no_traffic_interval (default of 60s) before trying again.

This commit makes health checks wait until secrets are loaded before
starting, implemented by adding support for registering a callback on
TransportSocketFactory. SslSocketFactory supports invoking the callback
when secrets are loaded, and all other socket types currently invoke the
callback immediately, preserving existing behavior.

Additional Description:
Risk Level: Medium
Testing: Unit tests
Docs Changes: N/A
Release Notes: Updated
Fixes #12389

This fixes a sequencing issue that could result in undesirable Envoy bootup times. When active health checks are configured on a cluster using a transport socket that uses SDS to fetch secrets, it was possible for the initial health check to run before secrets are loaded, which results in a failure. The health check would then wait for the no_traffic_interval (default of 60s) before trying again. This commit makes health checks wait until secrets are loaded before starting, implemented by adding support for registering a callback on TransportSocketFactory. SslSocketFactory supports invoking the callback when secrets are loaded, and all other socket types currently invoke the callback immediately, preserving existing behavior. Signed-off-by: Michael Puncel <mpuncel@squareup.com>

Signed-off-by: Michael Puncel <mpuncel@squareup.com>

PiotrSikora · 2020-10-12T19:42:33Z

Drive-by comment, but wouldn't #13344 (keep clusters warming) also solve this?

Signed-off-by: Michael Puncel <mpuncel@squareup.com>

mpuncel · 2020-10-12T20:05:29Z

@PiotrSikora interesting, I was unaware of that PR! From a quick look, I don't think that PR actually delays health checks, so it does not address the problem from #12389. If there was a delay fetching a secret, Envoy would still enter the LIVE state in the middle of a no_traffic_interval which could be a minute long, so there could be up to a minute where Envoy is "ready" but has not performed any "real" health checks (meaning TLS secrets were loaded for it).

I do think this PR might coincidentally fix #11120 for some cases, because the cluster is not considered warmed until health checks have run for one interval. There would still be some edge cases, however, for instance if health checks are performed over a raw_buffer socket that does not need to load secrets, they would still run immediately and the cluster might be marked as warmed before all secrets are loaded. So that PR still has value as well.

mpuncel · 2020-10-12T20:13:34Z

That PR could maybe be modified to delay loadCluster until secrets have loaded though, which I think would then obviate the need for this one. That might block more of the cluster initialization process than is necessary though, such as DNS resolution or EDS

Signed-off-by: Michael Puncel <mpuncel@squareup.com>

include/envoy/network/transport_socket.h

lizan · 2020-10-13T20:08:02Z

@PiotrSikora interesting, I was unaware of that PR! From a quick look, I don't think that PR actually delays health checks, so it does not address the problem from #12389. If there was a delay fetching a secret, Envoy would still enter the LIVE state in the middle of a no_traffic_interval which could be a minute long, so there could be up to a minute where Envoy is "ready" but has not performed any "real" health checks (meaning TLS secrets were loaded for it).

Envoy will still enter LIVE state but the cluster won't be active, then no HC will be there, isn't that the case?

Signed-off-by: Michael Puncel <mpuncel@squareup.com>

* master: (22 commits) http: using CONNECT_ERROR for HTTP/2 (envoyproxy#13519) listener: respect address.pipe.mode (it didn't work) (envoyproxy#13493) examples: Fix more deprecations/warnings in configs (envoyproxy#13529) overload: tcp connection refusal overload action (envoyproxy#13311) tcp: towards pluggable upstreams (envoyproxy#13331) conn_pool: fixing comments (envoyproxy#13520) Prevent SEGFAULT when disabling listener (envoyproxy#13515) Convert overload manager config literals to YAML (envoyproxy#13518) Fix runtime feature variable name (envoyproxy#13533) dependencies: refactor repository location schema utils, cleanups. (envoyproxy#13452) router: fix an invalid ASSERT when encoding metadata frames in the router. (envoyproxy#13511) http2: Proactively disconnect connections flooded when resetting stream (envoyproxy#13482) ci use azp to sync filter example (envoyproxy#13501) mongo_proxy: support configurable command list for metrics (envoyproxy#13494) http local rate limit: note token bucket is shared (envoyproxy#13525) wasm/extensions: Wasm extension policy. (envoyproxy#13526) http: removing envoy.reloadable_features.http1_flood_protection (envoyproxy#13508) build: update ppc64le CI build status shield (envoyproxy#13521) dependencies: enforce dependency shepherd sign-off via RepoKitteh. (envoyproxy#13522) Add no_traffic_healthy_interval (envoyproxy#13336) ... Signed-off-by: Michael Puncel <mpuncel@squareup.com>

mpuncel · 2020-10-14T00:54:53Z

@lizan it looks like that PR doesn't change the flow until onClusterInit() is called, which is after the initial round of health checks has completed, so it doesn't solve my problem.

Even after that PR is merged, I believe this problematic sequence is still possible:

Health checks are performed (and fail because secrets aren't loaded)
(new functionality) cluster manager properly recognizes that the cluster isn't "warmed" and waiting on secrets load
secrets load, Envoy goes LIVE, all clusters marked active.
envoy goes LIVE, requests fail with "no healthy upstream". This period could be almost 60 seconds long because of the default for no_traffic_interval
Requests succeed after the next health check interval

That said, that PR could maybe be modified to block for secrets updating before calling cluster.initialize() or something like that, but that will also delay EDS/DNS waiting on secrets that aren't necessary. Alternatively change the cluster API so it can do like cluster.startHealthChecks() after secrets load

Signed-off-by: Michael Puncel <mpuncel@squareup.com>

source/common/upstream/health_checker_base_impl.cc

lizan

Can you merge main branch?

* master: ci: use multiple stage (envoyproxy#13557) tls: update BoringSSL to 2192bbc8 (4240). (envoyproxy#13567) fix macos v8 build (envoyproxy#13572) Fixed Health Check Fuzz corpus syntax (envoyproxy#13576) ci: Remove shellcheck diff (envoyproxy#13560) ci: Increate brew retry interval (envoyproxy#13565) dependencies: fix some of the fallout from Wasm merge. (envoyproxy#13569) hds: add support for delta updates in specifier (envoyproxy#13067) ci: workaround for actions/runner-images#1811 (envoyproxy#13577) ratelimit: be able to disable x-envoy-ratelimited response header sent (envoyproxy#13270) Update opencensus library (envoyproxy#13549) ci: use azp for api and go-control-plane sync (envoyproxy#13550) docs: Remove/make generic lyft references in docs (envoyproxy#13559) check_format: adding 2 more release note checks (envoyproxy#13444) [Wasm] Add cluster metadata fallback and upstream host metadata (envoyproxy#13477) [fuzz] Added validation for secrets (envoyproxy#13543) Add Platform Specific Feature guidance to PR template (envoyproxy#13547) Signed-off-by: Michael Puncel <mpuncel@squareup.com>

mpuncel · 2020-10-15T23:06:14Z

merged!

Signed-off-by: Michael Puncel <mpuncel@squareup.com>

* master: (22 commits) delay health checks until transport socket secrets are ready. (envoyproxy#13516) test, oauth2: Make sure config test runs field validation (envoyproxy#13496) [http] swap codec implementations to default new (envoyproxy#13579) wasm: update proxy-wasm-cpp-host (envoyproxy#13606) postgres: do not copy and linearize received data when it is not going to be used (envoyproxy#13393) configs: Update configs v2 -> v3 (envoyproxy#13562) http2: Remove RELEASE_ASSERTs in sendPendingFrames() error handling (envoyproxy#13546) dependencies: track untracked implied dependencies, wrapup dashboard. (envoyproxy#13571) listener: add match all filter chain (envoyproxy#13449) fix mistakes in docstrings (envoyproxy#13603) ratelimit: add route entry metadata to ratelimit actions (envoyproxy#13269) cluster manager: avoid immediate activation for dynamic inserted cluster when initialize (envoyproxy#12783) ext_authz: Avoid calling check multiple times (envoyproxy#13288) docs: Unexclude remaining configs from validation (envoyproxy#13534) build: update rules_rust to allow Rustc in RBE (envoyproxy#13595) docs: Update sphinxext.rediraffe (envoyproxy#13589) Deprecate moonjit support on Windows before beta (envoyproxy#13541) dependencies: bump LuaJIT to 2.1 branch HEAD @ e9af1ab. (envoyproxy#13474) docs: add TLS stats to cluster stats doc (envoyproxy#13561) ci: stop building alpine-debug images in favor of ubuntu-based debug image (envoyproxy#13598) ... Signed-off-by: Michael Puncel <mpuncel@squareup.com>

rgs1 · 2020-10-19T23:24:07Z

x-posting in case someone else bumps into this issue, we had to revert this internally because it breaks (on a significant number of cases) hot restart for us (gets stuck in PRE_INIT).

mattklein123 · 2020-10-19T23:28:15Z

@rgs1 @mpuncel can we get this change reverted while we debug?

…envoyproxy#13516)" This reverts commit 34b67f9. Signed-off-by: Raul Gutierrez Segales <rgs@pinterest.com>

rgs1 · 2020-10-19T23:30:40Z

#13639

mpuncel · 2020-10-20T00:51:02Z

@mattklein123 fine with me! Where is the best place to discuss what might have gone wrong here and how to proceed? Should I open an issue or another PR after the revert is merged?

mattklein123 · 2020-10-20T02:04:08Z

@mattklein123 fine with me! Where is the best place to discuss what might have gone wrong here and how to proceed? Should I open an issue or another PR after the revert is merged?

I would reopen a PR, and then try to work with @rgs1 to debug? I don't immediately see how hot restart would change the behavior of this PR so will need some poking.

…#13516)" (#13639) This reverts commit 34b67f9. Signed-off-by: Raul Gutierrez Segales <rgs@pinterest.com>

* master: (22 commits) ci: various improvements (envoyproxy#13660) dns: fix defunct fd bug in apple resolver (envoyproxy#13641) build: support ppc64le with wasm (envoyproxy#13657) [fuzz] Added random load balancer fuzz (envoyproxy#13400) dependencies: compute and check release dates via GitHub API. (envoyproxy#13582) mac ci: try ignoring update failure (envoyproxy#13658) watchdog: Optimize WatchdogImpl::touch in preparation to more frequent petting of the watchdog. (envoyproxy#13103) typos: fix a couple 'enovy' mispellings (envoyproxy#13645) lua: Expose stream info downstreamLocalAddress and downstreamDirectRemoteAddress for Lua filter (envoyproxy#13536) tap: fix upstream streamed transport socket taps (envoyproxy#13638) Revert "delay health checks until transport socket secrets are ready. (envoyproxy#13516)" (envoyproxy#13639) Watchdog: use abort action as a default if killing is enabled. (envoyproxy#13523) [fuzz] Fixed divide by zero bug (envoyproxy#13545) wasm: flip the meaning of the "repository" in envoy_wasm_cc_binary(). (envoyproxy#13621) fix: record recovered local address (envoyproxy#13581) docs: fix incorrect compressor filter doc (envoyproxy#13611) docs: clean up docs for azp migration (envoyproxy#13558) wasm: fix building Wasm example. (envoyproxy#13619) test: Refactor flood tests into a separate test file (envoyproxy#13556) wasm: re-enable tests with precompiled modules. (envoyproxy#13583) ... Signed-off-by: Michael Puncel <mpuncel@squareup.com>

This commit is a revival of envoyproxy#13516 which aimed to solve envoyproxy#12389. That PR was merged and then reverted due to a deadlock bug, which is still present in this commit. The fix will be applied in a subsequent commit in order to make the resolution clearer to PR reviewers. Signed-off-by: Michael Puncel <mpuncel@squareup.com>

mpuncel requested review from alyssawilk, htuch, lizan, mattklein123 and PiotrSikora as code owners October 12, 2020 18:27

mpuncel added 2 commits October 12, 2020 14:35

fix format, git add release notes

122a8ff

Signed-off-by: Michael Puncel <mpuncel@squareup.com>

fix spelling error

1b1a5b2

Signed-off-by: Michael Puncel <mpuncel@squareup.com>

make check_format.py regex less strict on version history lines

c7c0aea

Signed-off-by: Michael Puncel <mpuncel@squareup.com>

htuch assigned lizan Oct 12, 2020

mpuncel added 5 commits October 12, 2020 21:47

Undo changes to check_format.py

6ed4855

Signed-off-by: Michael Puncel <mpuncel@squareup.com>

fix missing semicolon

df066b9

Signed-off-by: Michael Puncel <mpuncel@squareup.com>

fix a mock with the wrong signature

47075f3

Signed-off-by: Michael Puncel <mpuncel@squareup.com>

fix deadlock running callbacks

24243d8

Signed-off-by: Michael Puncel <mpuncel@squareup.com>

fix another deadlock in addSecretsReadyCb

d3bba47

Signed-off-by: Michael Puncel <mpuncel@squareup.com>

lizan reviewed Oct 13, 2020

View reviewed changes

include/envoy/network/transport_socket.h Outdated Show resolved Hide resolved

mpuncel added 2 commits October 13, 2020 20:38

add more unit test coverage

177e31c

Signed-off-by: Michael Puncel <mpuncel@squareup.com>

PR feedback: rename addSecretsReadyCb to addReadyCb

2cd1719

Signed-off-by: Michael Puncel <mpuncel@squareup.com>

htuch reviewed Oct 14, 2020

View reviewed changes

source/common/upstream/health_checker_base_impl.cc Show resolved Hide resolved

lizan previously approved these changes Oct 15, 2020

View reviewed changes

clear callbacks vector after invoking callbacks

6b736ed

Signed-off-by: Michael Puncel <mpuncel@squareup.com>

mpuncel dismissed lizan’s stale review via 6b736ed October 16, 2020 11:20

lizan approved these changes Oct 16, 2020

View reviewed changes

lizan merged commit 34b67f9 into envoyproxy:master Oct 16, 2020

lizan mentioned this pull request Oct 16, 2020

sds: keep warming when dynamic inserted cluster can't be extracted secret entity #13344

Merged

rgs1 mentioned this pull request Oct 19, 2020

cluster manager: avoid immediate activation for dynamic inserted cluster when initialize #12783

Merged

rgs1 pushed a commit to rgs1/envoy that referenced this pull request Oct 19, 2020

Revert "delay health checks until transport socket secrets are ready. (…

6a5da60

…envoyproxy#13516)" This reverts commit 34b67f9. Signed-off-by: Raul Gutierrez Segales <rgs@pinterest.com>

rgs1 mentioned this pull request Oct 19, 2020

Revert "delay health checks until transport socket secrets are ready.… #13639

Merged

mattklein123 pushed a commit that referenced this pull request Oct 20, 2020

Revert "delay health checks until transport socket secrets are ready. (…

ef125e3

…#13516)" (#13639) This reverts commit 34b67f9. Signed-off-by: Raul Gutierrez Segales <rgs@pinterest.com>

flashyang mentioned this pull request Apr 14, 2021

Should HC activation be delayed until needed secrets are available? #15977

Closed

mpuncel mentioned this pull request Apr 29, 2021

health_checker: Make health check loop wait for any required SDS secrets to be loaded before starting. #16236

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

delay health checks until transport socket secrets are ready. #13516

delay health checks until transport socket secrets are ready. #13516

mpuncel commented Oct 12, 2020

PiotrSikora commented Oct 12, 2020

mpuncel commented Oct 12, 2020 •

edited

Loading

mpuncel commented Oct 12, 2020

lizan commented Oct 13, 2020

mpuncel commented Oct 14, 2020 •

edited

Loading

lizan left a comment

mpuncel commented Oct 15, 2020

rgs1 commented Oct 19, 2020

mattklein123 commented Oct 19, 2020

rgs1 commented Oct 19, 2020

mpuncel commented Oct 20, 2020 •

edited

Loading

mattklein123 commented Oct 20, 2020

delay health checks until transport socket secrets are ready. #13516

delay health checks until transport socket secrets are ready. #13516

Conversation

mpuncel commented Oct 12, 2020

PiotrSikora commented Oct 12, 2020

mpuncel commented Oct 12, 2020 • edited Loading

mpuncel commented Oct 12, 2020

lizan commented Oct 13, 2020

mpuncel commented Oct 14, 2020 • edited Loading

lizan left a comment

Choose a reason for hiding this comment

mpuncel commented Oct 15, 2020

rgs1 commented Oct 19, 2020

mattklein123 commented Oct 19, 2020

rgs1 commented Oct 19, 2020

mpuncel commented Oct 20, 2020 • edited Loading

mattklein123 commented Oct 20, 2020

mpuncel commented Oct 12, 2020 •

edited

Loading

mpuncel commented Oct 14, 2020 •

edited

Loading

mpuncel commented Oct 20, 2020 •

edited

Loading